left hand
- North America > United States > Minnesota (0.05)
- North America > United States > Indiana (0.05)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- (6 more...)
BG-HOP: A Bimanual Generative Hand-Object Prior
Krishna, Sriram, Chittupalli, Sravan, Park, Sungjae
In this work, we present BG-HOP, a generative prior that seeks to model bimanual hand-object interactions in 3D. W e address the challenge of limited bimanual interaction data by extending existing single-hand generative priors, demonstrating preliminary results in capturing the joint distribution of hands and objects. Our experiments showcase the model's capability to generate bimanual interactions and synthesize grasps for given objects. W e make code and models publicly available.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Asia (0.04)
EgoSurgery-HTS: A Dataset for Egocentric Hand-Tool Segmentation in Open Surgery Videos
Darjana, Nathan, Fujii, Ryo, Saito, Hideo, Kajita, Hiroki
Egocentric open-surgery videos capture rich, fine-grained details essential for accurately modeling surgical procedures and human behavior in the operating room. A detailed, pixel-level understanding of hands and surgical tools is crucial for interpreting a surgeon's actions and intentions. We introduce EgoSurgery-HTS, a new dataset with pixel-wise annotations and a benchmark suite for segmenting surgical tools, hands, and interacting tools in egocentric open-surgery videos. Specifically, we provide a labeled dataset for (1) tool instance segmentation of 14 distinct surgical tools, (2) hand instance segmentation, and (3) hand-tool segmentation to label hands and the tools they manipulate. Using EgoSurgery-HTS, we conduct extensive evaluations of state-of-the-art segmentation methods and demonstrate significant improvements in the accuracy of hand and hand-tool segmentation in egocentric open-surgery videos compared to existing datasets. The dataset will be released at https://github.com/Fujiry0/EgoSurgery.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.28)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Asia > Japan > Honshū > Kantō > Kanagawa Prefecture > Yokohama (0.04)
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models
Cai, Mu, Tan, Reuben, Zhang, Jianrui, Zou, Bocheng, Zhang, Kai, Yao, Feng, Zhu, Fangrui, Gu, Jing, Zhong, Yiwu, Shang, Yuzhang, Dou, Yao, Park, Jaden, Gao, Jianfeng, Lee, Yong Jae, Yang, Jianwei
Understanding fine-grained temporal dynamics is crucial for multimodal video comprehension and generation. Due to the lack of fine-grained temporal annotations, existing video benchmarks mostly resemble static image benchmarks and are incompetent at evaluating models for temporal understanding. In this paper, we introduce TemporalBench, a new benchmark dedicated to evaluating fine-grained temporal understanding in videos. TemporalBench consists of ~10K video question-answer pairs, derived from ~2K high-quality human annotations detailing the temporal dynamics in video clips. As a result, our benchmark provides a unique testbed for evaluating various temporal understanding and reasoning abilities such as action frequency, motion magnitude, event order, etc. Moreover, it enables evaluations on various tasks like both video question answering and captioning, both short and long video understanding, as well as different models such as multimodal video embedding models and text generation models. Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench, demonstrating a significant gap (~30%) between humans and AI in temporal understanding. Furthermore, we notice a critical pitfall for multi-choice QA where LLMs can detect the subtle changes in negative captions and find a centralized description as a cue for its prediction, where we propose Multiple Binary Accuracy (MBA) to correct such bias. We hope that TemporalBench can foster research on improving models' temporal reasoning capabilities. Both dataset and evaluation code will be made available.
- Europe > Switzerland > Zürich > Zürich (0.14)
- Oceania > Australia > Western Australia > Perth (0.04)
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- (13 more...)
- Research Report > New Finding (0.48)
- Research Report > Promising Solution (0.34)
MM-Ego: Towards Building Egocentric Multimodal LLMs
Ye, Hanrong, Zhang, Haotian, Daxberger, Erik, Chen, Lin, Lin, Zongyu, Li, Yanghao, Zhang, Bowen, You, Haoxuan, Xu, Dan, Gan, Zhe, Lu, Jiasen, Yang, Yinfei
This research aims to comprehensively explore building a multimodal foundation model for egocentric video understanding. To achieve this goal, we work on three fronts. First, as there is a lack of QA data for egocentric video understanding, we develop a data engine that efficiently generates 7M high-quality QA samples for egocentric videos ranging from 30 seconds to one hour long, based on human-annotated data. This is currently the largest egocentric QA dataset. Second, we contribute a challenging egocentric QA benchmark with 629 videos and 7,026 questions to evaluate the models' ability in recognizing and memorizing visual details across videos of varying lengths. We introduce a new de-biasing evaluation method to help mitigate the unavoidable language bias present in the models being evaluated. Third, we propose a specialized multimodal architecture featuring a novel "Memory Pointer Prompting" mechanism. This design includes a global glimpse step to gain an overarching understanding of the entire video and identify key visual information, followed by a fallback step that utilizes the key visual information to generate responses. This enables the model to more effectively comprehend extended video content. With the data, benchmark, and model, we successfully build MM-Ego, an egocentric multimodal LLM that shows powerful performance on egocentric video understanding.
Words2Contact: Identifying Support Contacts from Verbal Instructions Using Foundation Models
Totsila, Dionis, Rouxel, Quentin, Mouret, Jean-Baptiste, Ivaldi, Serena
This paper presents Words2Contact, a language-guided multi-contact placement pipeline leveraging large language models and vision language models. Our method is a key component for language-assisted teleoperation and human-robot cooperation, where human operators can instruct the robots where to place their support contacts before whole-body reaching or manipulation using natural language. Words2Contact transforms the verbal instructions of a human operator into contact placement predictions; it also deals with iterative corrections, until the human is satisfied with the contact location identified in the robot's field of view. We benchmark state-of-the-art LLMs and VLMs for size and performance in contact prediction. We demonstrate the effectiveness of the iterative correction process, showing that users, even naive, quickly learn how to instruct the system to obtain accurate locations. Finally, we validate Words2Contact in real-world experiments with the Talos humanoid robot, instructed by human operators to place support contacts on different locations and surfaces to avoid falling when reaching for distant objects.
- Europe > Netherlands > South Holland > Dordrecht (0.04)
- Europe > France (0.04)
- Instructional Material (0.48)
- Research Report (0.40)
Towards Human-Level Bimanual Dexterous Manipulation with Reinforcement Learning
Chen, Yuanpei, Wu, Tianhao, Wang, Shengjie, Feng, Xidong, Jiang, Jiechuang, McAleer, Stephen Marcus, Geng, Yiran, Dong, Hao, Lu, Zongqing, Zhu, Song-Chun, Yang, Yaodong
Achieving human-level dexterity is an important open problem in robotics. However, tasks of dexterous hand manipulation, even at the baby level, are challenging to solve through reinforcement learning (RL). The difficulty lies in the high degrees of freedom and the required cooperation among heterogeneous agents (e.g., joints of fingers). In this study, we propose the Bimanual Dexterous Hands Benchmark (Bi-DexHands), a simulator that involves two dexterous hands with tens of bimanual manipulation tasks and thousands of target objects. Specifically, tasks in Bi-DexHands are designed to match different levels of human motor skills according to cognitive science literature. We built Bi-DexHands in the Issac Gym; this enables highly efficient RL training, reaching 30,000+ FPS by only one single NVIDIA RTX 3090. We provide a comprehensive benchmark for popular RL algorithms under different settings; this includes Single-agent/Multi-agent RL, Offline RL, Multi-task RL, and Meta RL. Our results show that the PPO type of on-policy algorithms can master simple manipulation tasks that are equivalent up to 48-month human babies (e.g., catching a flying object, opening a bottle), while multi-agent RL can further help to master manipulations that require skilled bimanual cooperation (e.g., lifting a pot, stacking blocks). Despite the success on each single task, when it comes to acquiring multiple manipulation skills, existing RL algorithms fail to work in most of the multi-task and the few-shot learning settings, which calls for more substantial development from the RL community.
- North America > United States > Montana (0.04)
- Asia > China > Beijing > Beijing (0.04)
The Strange Friendships of Ursula K. Le Guin's "The Left Hand of Darkness"
I never met Ursula K. Le Guin, who died on January 22, 2018, at the age of eighty-eight, in Portland, Oregon, her home for many years. And yet we became good friends during the last two months of her life, entirely by way of e-mail. I inaugurated the correspondence on November 21, 2017, and she replied on November 24th. One of the things I like least about being very old is the unreliability of my energy. Working at poetry or a story is, always has been, the job I want to be doing, the work that keeps me steady and content.
Silicon Valley Event On Machine Learning Tackles The Latest Riddles Vexing AI Self-Driving Cars
There's a child's riddle that asks you to indicate what can be held in your left hand and yet cannot be held in your right hand. Take a moment to ponder this riddle. Your first thought might be that anything that could be held in your left hand should also be able to be held in your right hand, assuming of course that there's no trickery involved. One trick might be that you could hold your right hand in your left hand, but that you cannot presumably "hold" your right hand in your right hand since your right hand is your right hand. Another trick might be that your right hand is perchance weaker than your left hand, thus if an object was heavy, potentially you could hold it in your left hand, but you could not do so with your less powerful right hand. If we eliminate all the trickery potential answers, what else remains?
- Transportation > Ground > Road (1.00)
- Automobiles & Trucks > Manufacturer (1.00)
- Transportation > Passenger (0.93)
- Information Technology > Robotics & Automation (0.93)
Schoolgirl, 11, gets her first bionic arm fitted after being born without a right hand
A schoolgirl has had her first ever bionic arm fitted after being born without a right hand. Hollie Lownds, 11, was fitted with the'Iron Man' themed bionic arm, which is worth £5,000, in September. It will allow her to brush her hair, eat with a knife and fork and ride a bike for the first time, and she is particularly excited to open Christmas presents with two hands. Hollie's parents were told 20 weeks into the pregnancy that their daughter was missing her right hand because of a growth defect but the cause wasn't clear to doctors. Since she was born she hasn't had a prosthetic arm and has tried to use the stump of her elbow joint to grasp things and open doors.